## ✨ FAL and FAL+ Integration Experiments

This repository has been extended to compare the proposed **FAL (First Attentions Last)** and **FAL+** mechanisms across various attention configurations. We introduce 9 subfolders, each representing a unique combination of base attention type and architectural enhancement:

| Folder         | Architecture Description                                         |
| -------------- | ---------------------------------------------------------------- |
| `MHA/MHA`      | Vanilla Multi-Head Attention without FAL                         |
| `GQA/GQA`      | Grouped Query Attention (GQA) baseline                           |
| `MoE/MoE`      | Mixture-of-Experts (MoE) attention baseline                      |
| `MHA/FAL`      | Baseline Cramming with FAL (first attention re-routed to MLP)    |
| `MHA/FAL+`     | FAL+ variant: augments MHA–MLP connection instead of removing it |
| `GQA/FAL`      | GQA with FAL                                                     |
| `GQA/FAL+`     | GQA with FAL+                                                    |
| `MoE/FAL`      | MoE with FAL                                                     |
| `MoE/FAL+`     | MoE with FAL+                                                    |

These configurations are designed to isolate the impact of FAL-style attention routing in different attention mechanisms. Basically, each experiment follows the standard cramming setup (single-GPU 24-hour budget, MLM pretraining), allowing for a fair architectural comparison.

### How to Run Each Variant

Each folder contains a customized config and architecture variant. To run pretraining, simply execute:

```bash
python pretrain.py name=<EXPERIMENT_NAME> arch=crammed-large-izsak arch.num_transformer_layers=<36~60> train=bert-o4 data=pile-readymade budget=24 train.scheduler=budget-triangle2 train.batch_size_ramp=0.60
```

If you want to run experiments with the same scheduling regardless of GPU performance, you can use the following command. (A total of 1.02B tokens will be ingested.)


```bash
python pretrain.py name=<EXPERIMENT_NAME> arch=crammed-large-izsak arch.num_transformer_layers=<36~60> train=bert-o4 data=pile-readymade budget=99 impl.microbatch_size=16 train.steps=500_000 train.scheduler=triangle2 train.batch_size_ramp=300_000
```

For MoE-based models whose execution path depends on the input, make sure to set `impl.compile_torch=False` to avoid issues with dynamic behavior under `torch.compile`.

```bash
python pretrain.py name=<EXPERIMENT_NAME> arch=crammed-large-izsak arch.num_transformer_layers=<36~60> train=bert-o4 data=pile-readymade budget=99 impl.microbatch_size=16 train.steps=500_000 train.scheduler=triangle2 train.batch_size_ramp=300_000 impl.compile_torch=False
```